Craftcans.com - cleaning

Craftcans.com provides a database of 2692 crafted canned beers. The data on beers includes the following variables:

  • Name
  • Style
  • Size
  • Alcohol by volume (ABV)
  • IBU’s
  • Brewer name
  • Brewer location

However, some of the variables include both number and text values (e.g. Size), while others include missing values (e.g. IBUs). In order to make the dataset ready for analysis, one needs to clean it first. We will do that using pandas and regular expressions.


In [2]:
import pandas, re

In [3]:
data = pandas.read_excel("craftcans.xlsx")

In [4]:
data.head()


Out[4]:
0 1 2 3 4 5 6 7
0 ENTRY BEER BREWERY LOCATION STYLE SIZE ABV IBUs
1 2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 oz. 4.5% 50
2 2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 oz. 4.9% 26
3 2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 oz. 4.8% 19
4 2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 oz. 6.0% 38

As it can be seen above, the header row is saved as a simple observation unit. Let's rename the columns with the real headers.


In [5]:
data.columns = data.iloc[0]

In [6]:
data.head()


Out[6]:
ENTRY BEER BREWERY LOCATION STYLE SIZE ABV IBUs
0 ENTRY BEER BREWERY LOCATION STYLE SIZE ABV IBUs
1 2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 oz. 4.5% 50
2 2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 oz. 4.9% 26
3 2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 oz. 4.8% 19
4 2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 oz. 6.0% 38

Fine, but the very first row still remains. We have to drop it, and for that we will use the drop() function from the pandas library, which takes 2 arguments: the dropable row/column name and the axis (0 for rows and 1 for columns).


In [7]:
data = data.drop(0,axis=0)

In [8]:
data.head()


Out[8]:
ENTRY BEER BREWERY LOCATION STYLE SIZE ABV IBUs
1 2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 oz. 4.5% 50
2 2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 oz. 4.9% 26
3 2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 oz. 4.8% 19
4 2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 oz. 6.0% 38
5 2688. Stronghold NorthGate Brewing Minneapolis, MN American Porter 16 oz. 6.0% 25

Let's do the same for row names. Rows are called indecies in Pandas. Thus, let's take the values from the "ENTRY" column and use them to rename rows. Then, of course, we shoudl drop the additional column too.


In [9]:
data.index = data["ENTRY"]

In [10]:
data.head()


Out[10]:
ENTRY BEER BREWERY LOCATION STYLE SIZE ABV IBUs
ENTRY
2692. 2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 oz. 4.5% 50
2691. 2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 oz. 4.9% 26
2690. 2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 oz. 4.8% 19
2689. 2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 oz. 6.0% 38
2688. 2688. Stronghold NorthGate Brewing Minneapolis, MN American Porter 16 oz. 6.0% 25

In [11]:
data = data.drop("ENTRY",axis=1)

In [12]:
data.head()


Out[12]:
BEER BREWERY LOCATION STYLE SIZE ABV IBUs
ENTRY
2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 oz. 4.5% 50
2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 oz. 4.9% 26
2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 oz. 4.8% 19
2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 oz. 6.0% 38
2688. Stronghold NorthGate Brewing Minneapolis, MN American Porter 16 oz. 6.0% 25

Nice, now let's clean some variables. Let's start from the SIZE. It includes information on size which is presented in oz or ounces or differently. We need to have numbers only. Let's first see what are the available options. FOr that purpose, we can convert that columns to a list and then use the set() function to get the unique values from the list.


In [13]:
data_list = data["SIZE"].tolist()
unique_values = set(data_list)
print(unique_values)


set([u'19.2 oz.', u'12 OZ.', u'12 & 16 oz.', u'19.2', u'12 oz', u'24 oz.', u'12 oz.', u'32 oz.', u'24 oz. "Silo Can"', u'16 oz', u'12 oz. Slimline', u'16.9 oz.', u'16 oz. Alumi-Tek\xae', u'12 ounce', u'16 oz.', u'8.4 oz.'])

Excellent. This means we can write a regular expression that will find all the digits (including those that have a dot inside) and subsitute whatever comes afterwards with an empty string.


In [14]:
for i in range(0,len(data['SIZE'])):
    data['SIZE'][i] = re.sub('(^.*\d)(\s*.*$)',r'\1',data['SIZE'][i])

In [15]:
data.head()


Out[15]:
BEER BREWERY LOCATION STYLE SIZE ABV IBUs
ENTRY
2692. Get Together NorthGate Brewing Minneapolis, MN American IPA 16 4.5% 50
2691. Maggie's Leap NorthGate Brewing Minneapolis, MN Milk / Sweet Stout 16 4.9% 26
2690. Wall's End NorthGate Brewing Minneapolis, MN English Brown Ale 16 4.8% 19
2689. Pumpion NorthGate Brewing Minneapolis, MN Pumpkin Ale 16 6.0% 38
2688. Stronghold NorthGate Brewing Minneapolis, MN American Porter 16 6.0% 25

Done! Let's now go for the ABV variable. It is given in %-s, so we can keep the number only, and divide it by 100 to get the float value. But there may be some wrongly inputed values in the columns also. So let's divide only those that have correct values and assign a "missing value" value to others.


In [17]:
# for all values in that columns
for i in range(0,len(data['ABV'])):
    # if match exists, which means it is a correct value
    if re.match('(^.*\d)(%)',data['ABV'][i]) is not None:
        # substitute the % sign with nothing, convert result to float and divide by 100
        data['ABV'][i] = float(re.sub('(^.*\d)(%)',r'\1',data['ABV'][i]))/100
    else: # which is when the value is incorrect
        # give it the value of "nan" which stands for missing values
        data['ABV'][i] = float("nan")

In [18]:
data['ABV'].head(100)


Out[18]:
ENTRY
2692.    0.045
2691.    0.049
2690.    0.048
2689.     0.06
2688.     0.06
2687.    0.056
2686.     0.08
2685.    0.125
2684.    0.077
2683.    0.042
2682.     0.05
2681.    0.066
2680.     0.04
2679.    0.055
2678.    0.076
2677.    0.051
2676.    0.065
2675.     0.06
2674.     0.05
2673.    0.052
2672.    0.048
2671.    0.072
2670.    0.067
2669.    0.049
2668.     0.08
2667.    0.075
2666.     0.06
2664.     0.08
2663.    0.063
2662.    0.058
         ...  
2619.    0.058
2618.     0.07
2617.    0.055
2616.    0.055
2615.    0.054
2614.    0.053
2613.    0.055
2612.     0.05
2611.     0.06
2610.    0.064
2609.    0.058
2608.    0.065
2607.    0.068
2606.    0.078
2605.    0.052
2604.    0.049
2603.    0.065
2602.     0.05
2601.     0.08
2600.    0.085
2599.    0.069
2598.    0.098
2597.     0.06
2596.    0.062
2595.      NaN
2594.    0.052
2593.     0.07
2592.    0.059
2591.    0.045
2590.    0.055
Name: ABV, dtype: object

Great! Let's now get some info on our dataframe


In [19]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2348 non-null object
IBUs        1405 non-null object
dtypes: object(7)
memory usage: 150.6+ KB

As you can see the ABV guy is left with only 2348 values out of 2410, as we assigned "nan" to incorrect values. Let's impute those missing values. As it is a variable with integer values,we can impute with mean using the fillna() function from pandas.'


In [20]:
data['ABV'] = data['ABV'].fillna(data['ABV'].mean())

In [21]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2410 non-null float64
IBUs        1405 non-null object
dtypes: float64(1), object(6)
memory usage: 150.6+ KB

Done! But there is another variable with missing values: IBUs. Let's make an imputation for that one also, but this time instead of mean let's use the backward/forward filling method.


In [22]:
data['IBUs'] = data['IBUs'].fillna(method = "bfill")

In [23]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2410 entries, 2692. to 1.
Data columns (total 7 columns):
BEER        2410 non-null object
BREWERY     2410 non-null object
LOCATION    2410 non-null object
STYLE       2405 non-null object
SIZE        2410 non-null object
ABV         2410 non-null float64
IBUs        2410 non-null object
dtypes: float64(1), object(6)
memory usage: 150.6+ KB